智能论文笔记

MARLIN: Masked Autoencoder for facial video Representation LearnINg

Zhixi Cai , Shreya Ghosh , Kalin Stefanov , Abhinav Dhall , Jianfei Cai , Hamid Rezatofighi , Reza Haffari , Munawar Hayat

分类：计算机视觉

2022-11-12

This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our codes and pre-trained models will be made public.

translated by 谷歌翻译

我们提出了一种多层变量自动编码器方法，我们称为HR-VQVAE，该方法学习数据的层次离散表示。通过利用新的目标函数，HR-VQVAE中的每个层都通过量化的编码来学习从以前的层中的残差表示离散表示。此外，每一层的表示形式在层次上链接到以前的图层。我们评估了图像重建和生成任务的方法。实验结果表明，HR-VQVAE学到的离散表示使解码器能够比基线方法（即VQVAE和VQVAE-2）重建具有较小的变形的高质量图像。 HR-VQVAE还可以产生优于最先进的生成模型的高质量和多样化的图像，从而进一步验证学习表现的效率。 HR-VQVAE的层次结构性质i）减少了解码时间，使该方法特别适合高负载任务，ii）允许增加代码簿的大小而不会引起代码书折叠问题。

translated by 谷歌翻译

鉴于我们不断增加的在线形象和信息摄入，现实的虚假视频是传播有害错误信息的潜在工具。本文提出了一种基于多模式学习的方法，用于检测真实和虚假视频。该方法结合了来自三种模式的信息 - 音频，视频和生理学。我们通过将视频与生理学的信息增加或通过新颖地学习这两种方式与所提出的图形卷积网络体系结构的融合来研究两种结合视频和生理方式的策略。两种结合两种方式的策略都取决于一种新方法来生成生理信号的视觉表示。然后，对真实视频和虚假视频的检测是基于音频和修改视频方式之间的差异。在两个基准数据集上评估了所提出的方法，与以前的方法相比，结果显示检测性能显着增加。

translated by 谷歌翻译